{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
   "metadata": {},
   "source": [
    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "```\n",
    "make venv\n",
    "source venv/bin/activate && pip install jupyterlab\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "## This is here as a reference only\n",
    "# Users and application developers must use the right tag for the latest from pypi\n",
    "#!pip install data-prep-toolkit\n",
    "#!pip install data-prep-toolkit-transforms"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebf1f782-0e61-485c-8670-81066beb734c",
   "metadata": {},
   "source": [
    "##### ***** Import required Classes and modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bae63d15-4ce5-4f2a-a917-0f3161e9dd73",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dpk_fdedup.ray.transform import Fdedup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7234563c-2924-4150-8a31-4aec98c1bf33",
   "metadata": {},
   "source": [
    "##### ***** Setup runtime parameters for this transform\n",
    "We will only provide a description for the parameters used in this example. For a complete list of parameters, please refer to the README.md for this transform:\n",
    "|parameter:type | value | description |\n",
    "|-|-|-|\n",
    "| input_folder:str | \\${PWD}/ray/test-data/input/ | folder that contains the input parquet files for the fuzzy dedup algorithm |\n",
    "| output_folder:str | \\${PWD}/ray/output/ | folder that contains the all the intermediate results and the output parquet files for the fuzzy dedup algorithm |\n",
    "| contents_column:str | contents | name of the column that stores document text |\n",
    "| document_id_column:str | int_id_column | name of the column that stores document ID |\n",
    "| num_permutations:int | 112 | number of permutations to use for minhash calculation |\n",
    "| num_bands:int | 14 | number of bands to use for band hash calculation |\n",
    "| num_minhashes_per_band | 8 | number of minhashes to use in each band |\n",
    "| operation_mode:{filter_duplicates,filter_non_duplicates,annotate} | filter_duplicates | operation mode for data cleanup: filter out duplicates/non-duplicates, or annotate duplicate documents |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a54a78e9-d78b-4aeb-ac2b-806070a2dec0",
   "metadata": {},
   "outputs": [],
   "source": [
    "Fdedup(input_folder='ray/test-data/input',\n",
    "    output_folder='output',\n",
    "    contents_column= \"contents\",\n",
    "    document_id_column= \"int_id_column\",\n",
    "    num_permutations= 112,\n",
    "    num_bands= 14,\n",
    "    num_minhashes_per_band= 8,\n",
    "    operation_mode= \"filter_duplicates\",\n",
    "    run_locally= True).transform()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
   "metadata": {},
   "source": [
    "##### **** The specified folder will include the transformed parquet files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7276fe84-6512-4605-ab65-747351e13a7c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "glob.glob(\"output/cleaned/*\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d30489d9-fc98-423e-90a8-e8f372787e88",
   "metadata": {},
   "source": [
    "***** print the input data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b22234f-f7a1-4b92-b2ac-376b2545abce",
   "metadata": {},
   "outputs": [],
   "source": [
    "import polars as pl\n",
    "import os\n",
    "input_df = pl.read_parquet(os.path.join(os.path.abspath(\"\"), \"ray\", \"test-data\", \"input\", \"df1.parquet\"))\n",
    "with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1):\n",
    "    print(input_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5305d127-10fd-4fa6-97a6-ac47db2bdc7e",
   "metadata": {},
   "source": [
    "***** print the output result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b2eddb9-4fb6-41eb-916c-3741b9129f2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import polars as pl\n",
    "output_df = pl.read_parquet(os.path.join(os.path.abspath(\"\"), \"output\", \"cleaned\", \"df1.parquet\"))\n",
    "with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1):\n",
    "    print(output_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d60e391d-cf58-47ae-9991-04c05d114edc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "787c644e-2640-4c05-bdc2-8a261305a89f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
